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Abstract 

We propose a novel parameterized family of 
Mixed Membership Mallows Models (M4) to 
account for variability in pairwise comparisons 
generated by a heterogeneous population of 
noisy and inconsistent users. M4 models indi¬ 
vidual preferences as a user-specific probabilis¬ 
tic mixture of shared latent Mallows compo¬ 
nents. Our key algorithmic insight for estima¬ 
tion is to establish a statistical connection be¬ 
tween M4 and topic models by viewing pair¬ 
wise comparisons as words, and users as docu¬ 
ments. This key insight leads us to explore Mal¬ 
lows components with a separable structure and 
leverage recent advances in separable topic dis¬ 
covery. While separability appears to be overly 
restrictive, we nevertheless show that it is an in¬ 
evitable outcome of a relatively small number of 
latent Mallows components in a world of large 
number of items. We then develop an algorithm 
based on robust extreme-point identification of 
convex polygons to learn the reference rankings, 
and is provably consistent with polynomial sam¬ 
ple complexity guarantees. We demonstrate that 
our new model is empirically competitive with 
the current state-of-the-art approaches in predict¬ 
ing real-world preferences. 

1. Introduction 

The problem of predicting preference for a diverse user- 
population arises in many applications including personal 
recommendation systems, e-commerce and information 
retrieval (Volkovs & Zemel, 2014; Lu & Boutilier, 2014; 
Ding et ah, 2015). Pairwise comparisons of items by a 
heterogeneous and inconsistent population can now be ob¬ 
served and recorded over the web through transactions, 
clicks and check-ins for a large set of items. Our goal is 


to model, inference, and predict user behavior in pairwise 
comparisons. 

This paper proposes a new Mixed Membership Mal¬ 
lows Model (M4) for pairwise comparisons that lever¬ 
ages the widely used mixture of Mallows model (e.g., 
Lu & Boutilier, 2014; Awasthi et ah, 2014). The building 
block of M4 is the popular Mallows distribution on permu¬ 
tations. The pmf of Mallows model is centered around a 
reference ranking and the deviation is captured by a dis¬ 
persion constant (Mallows, 1957). M4 naturally captures 
the heterogeneous, inconsistent, and noisy behavior by as¬ 
suming each user’s comparisons as a probabilistic mixture 
of a few shared latent Mallows components. By design, 
the latent Mallows components capture the heterogeneous 
influencing factors in the population and the user-specific 
mixing weights reflect the influence of multiple latent fac¬ 
tors on each user. Furthermore, the randomness of each 
Mallows component captures the fact that the same latent 
factor can consistently result in different outcomes on dif¬ 
ferent users, more so far very similar items. Overall, M4 
generalizes the clustering perspective in mixture of Mal¬ 
lows model into a decomposition modeling perspective that 
better fits the emerging web-scale observations. 

The key contribution in this paper is to propose the first 
provable and polynomially efficient approach for learn¬ 
ing multiple Mallows components in mixed membership 
settings from pairwise comparisons. As a special case 
of M4, the mixture of Mallows model has received sig¬ 
nificant attention (Lebanon & Lafferty, 2002; Busse et ah, 
2007; Lu & Boutilier, 2014; Awasthi et ah, 2014), yet the¬ 
oretical guarantees are not clear except for special cases 
(Awasthi et ah, 2014). We propose to learn M4 by reduc¬ 
ing it to an instance of a probabilistic topic modeling (Blei, 
2012). Topic modeling for text corpus have been exten¬ 
sively studied but its connection to preference data is un¬ 
clear. We view users as “documents”, pairwise compar¬ 
isons as “words”, and the latent Mallows components as 
“topics”. This leads us to the question of topic discovery 
viewed within the context of M4. 


The key technical contribution of our approach is to prov- 
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ably discover latent factors with a non-exact separabil¬ 
ity structure. Our approach is geometrically inspired by 
the recent work in exact separable topic discovery (e.g.m 
Arora et al., 2013; Ding et al., 2013), and we provably gen¬ 
eralize it to approximately separability with finite degree of 
deviation. In M4, this requires for each Mallows compo¬ 
nent, there exist an item pair such that item A is preferred 
over B with very high probability in that Mallows compo¬ 
nent and B is preferred over A with high probability under 
the other Mallows component. While it might appear re¬ 
strictive, we show formally that approximate separability 
is inevitable and naturally arises from the fact that we have 
large set of items relative to the number of shared latent 
preferences. As a consequence, most large M4 are approx¬ 
imately separable. We then provably generalize the geom¬ 
etry property in solid angle from (Ding et al., 2014b) and 
establish guarantees for consistent estimation of reference 
rankings along with polynomial sample and computational 
complexity bounds. Our results only require the number 
of users to scale while allowing for the number of compar¬ 
isons per user to be small. 

1.1. Related work 

Rank estimation from full or partial preferences has 
been extensively studied in different settings for 
decades (Marden, 1995; Rajkumar & Agarwal, 2014; 
Volkovs & Zemel, 2014). The family of mixture of rank¬ 
ing models have demonstrated superior modeling power 
to capture a heterogeneous population with noisy observa¬ 
tions (e.g., Farias et al., 2009; Oh & Shah, 2014). In these 
models, each user is associated with one ranking compo¬ 
nent sampled from a set of multiple ranking components 
hence the population can be clustered into heterogeneous 
preference types. The mixture of Mallows model has 
received significant attention (Lebanon & Lafferty, 2002; 
Busse et al., 2007; Lu & Boutilier, 2014; Awasthi et al., 
2014). EM-based algorithms have been used for es¬ 
timation from pairwise comparisons (Lu & Boutilier, 
2014) or full rankings (Busse et al., 2007). Only recently, 
(Awasthi et al., 2014) proposed a provably correct al¬ 
gorithm based on tensor decomposition that can handle 
a mixture of 2 Mallows model using the top-3 ranked 
items as the observations which, in effect, requires users 
to consider all items. This is impractical within the 
context of the target web-scale applications. Since the 
mixture of Mallows is special case of M4 by positing a 
specific prior on each user’s mixing weights, our algorithm 
can thus be viewed as providing a powerful alternative 
approach for learning the mixture of Mallows model. We 
note that mixture of Bradley-Terry-Luce (BTL) models 
(Oh & Shah, 2014), mixture of Plackett-Luce (PL) models 
(Azari Soufiani et al., 2013) have been studied. 

Our model is closely related to (Ding et al., 2014a; 2015) 


that validated the advantages of adopting the mixed mem¬ 
bership perspective. Ding et al. (2015) models each latent 
ranking factor as a single permutation and is a degener¬ 
ate special case of Mallows distribution over the permuta¬ 
tions in M4. Therefore, while both (Ding et al., 2015) and 
M4 can capture the inconsistent behavior semming from 
the influence of multiple latent factors, M4 can further ac¬ 
count for the inconsistency as the consequence of the ran¬ 
domness within each Mallows components. Our approach 
has similar polynomial time and sample guarantees as in 
(Ding et al., 2015). We note that motivated by social choice 
application, Gormley & Murphy (2008) proposed another 
mixed membership ranking model where the latent “top¬ 
ics” are PL models. An MCMC based approach is used for 
estimation without theoretical guarantees. Table. 1 summa¬ 
rizes all the closely related works. 

Connection to Separable Topic Discovery: A key mo¬ 
tivation of our approach is the recent work on consistent 
and efficient topic discovery for topic matrices that have 
an exact separable structure (Arora et al., 2013; Ding et al., 
2014b). The exact separability has been exploited as 
a suitable approximation to many problems including 
topic modeling (Arora et al., 2013) and ranking estimation 
(Ding et al., 2015). 

Closely related to our technical settings is the so called 
near-separable structure where the observations are viewed 
as a noisy perturbation from some exact separable statistic. 
In the literature to-date, establishing provable guarantees 
requires the perturbation to go to zero via either data aug¬ 
mentation (Arora etak, 2013; Ding et al., 2013; 2015) or 
improving Signal-to-Noise-Ratio (Gillis & Vavasis, 2014; 
Benson et al., 2014). In contrast, the ideal statistic in our 
approach has a small but finite perturbation from the ex¬ 
actly separable ideal. Our provable guarantees require only 
a finite degree of approximate separability. We explicitly 
derive a sufficient condition that bounds on the degree of 
approximate separability. 

Bansal et al. (2014) recently proposed a provable approach 
that requires similar approximate separability as in our set¬ 
tings but requires a strong condition on the weight prior. In 
M4, it requires each user to have a dominant latent factor. 
In contrast, we only requires the second order moments of 
the prior to be full rank which is satisfied by many prior 
distributions (Arora et al., 2013). 

Rating based methods: Considerable work in preference 
prediction has focused on numerical ratings. The most im¬ 
portant idea is also to model the ratings as being influenced 
by a small number of latent factors shared by the popula¬ 
tion (e.g., Salakhutdinov & Mnih, 2008a). Although com¬ 
ing from a different feature space, our model shares the 
same mixed membership modeling perspective. 
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Table 1. Comparison to closely related works, “vertices” denote the prior that has non-zero probability only on the vertices of a simplex. 


Method 

Observation 

type 

Ranking component 

(“Topic”) 

Prior 

Distribution 

Consistency 

result 

Computation 

complexity 

M4 

pairwise 

Mallows 

general 

provable 

polynomial 

Ding et al. (2015) 

pairwise 

single ranking 

general 

provable 

polynomial 

Gormley & Murphy (2008) 

full 

Plackett-Luce 

Dirichlet 

not available 

not available 

Farias et al. (2009) 

pairwise 

single ranking 

vertices 

provable 

combinatorial 

Lu & Boutilier (2014) 

pairwise 

Mallows 

vertices 

not available 

not available 

Awasthi et al. (2014) 

top-3 rank 

Mallows 

vertices 

provable 

polynomial 

Oh & Shah (2014) 

pairwise 

Bradely-Terry-Luce 

vertices 

provable 

polynomial 


The rest of the paper is organized as follows. Section 2 
introduces the M4 model. In Sec. 3, we formally intro¬ 
duce the approximate separability and show that the set 
of approximate separable M4 models has an overwhelm¬ 
ing probability. Section 4 summarizes the steps of our 
algorithm and the computational and sample complexity 
bounds. We demonstrate competitive performances on 
some semi-synthetic and real-world datasets in Sec. 5. 

2. Mixed Membership Mallows Model 



Figure 1. Graphical representation of the proposed Mixed Mem¬ 
bership Mallows Model. The boxes represent replicates. The bot¬ 
tom outer plate represents users, and the inner plate represents 
ranking tokens and comparisons of each user. 

We now describe the generative process of the Mixed 
Membership Mallows Model (M4). To set up the problem, 
we consider a universe of Q items U = {1,..., Q} and a 
population of M users that each compares N > 2 pairs of 
items. We assume the item pairs to be compared, denoted 
by un-ordered pairs {i, j}, are drawn independently from 
some distribution p. The outcome of n-th pairwise compar¬ 
ison of user m is denoted by an ordered pair w m , n = ( i, j), 
if user m compares item i and j, and prefers i over j. 

We first introduce the Mallows model (Mallows, 1957). In 
M4, let the fc-th Mallows component define a probability 
distribution on the set of all permutations over the Q items. 
It is parameterized by a reference ranking cru and a disper¬ 
sion parameter <f>k £ [0,1]: 

PM(cr\a k ,4>k) = ^ ak) /Z k (1) 

where a denotes an arbitrary permutation, d(a, a k ) denotes 
the Kendall’s tau distance between two permutations, and 


Z k is the normalization constant. The generative process 
for the comparisons in M4 from user m = 1,..., M is, 

1. Sample ranking weight 6 m € A K from prior Pr(0). 

2. For each comparison n = 1,..., N, 

(a) Sample a pair of items {i,j} from p. 

(b) Sample a ranking token z £ {1..... /X'} ~ 
Multinomial ( 0 m ) 

(c) Sample a permutation <J m , n from z-th Mallows 
component with parameter ( a z , <f> z ) 

(d) If &m,n (%) 4 then w m ^ n — oth¬ 

erwise W m , n = (j, i ) 1 

Figure 1 is the plate representation of M4. The mixing 
weights 0 rn over the K shared Mallows components char¬ 
acterize each user. We denote by W x M matrix X for 
the empirical observations. Its W = Q(Q — 1) rows are 
indexed by all the ordered pairs (i, j). ^ m denotes the 

number of times that user m prefers item i over j. Given 
X and K, the primary problem in this paper is to learn the 
parameters of the shared latent Mallows component. 

Reduction to Topic Modeling 

We show that the problem of learning model parameters in 
M4 can be formally reduced to topic discovery in an equiv¬ 
alent topic model. To establish the connection, we first con¬ 
sider the distribution on the pairwise comparisons 

K 

p{Wm,n = {i,j)\0m) = X! PM^Wk, 4>k)0k,m 

/c=t a(i)<a(j) 

where pij = p,j^ > 0 is the probability of comparing item 
i and j. For further reference, we define ranking matrix 
to be a W x K dimension matrix (3 whose entries are, 

^ ^ PM^l&k > 0/c) (2) 

a: cr(i)<a(j) 

Statistically, represents the probability that item i 

is preferred over item j if the ranking is sampled from the 
/c-th Mallows component. The fc-th column of (3 therefore 

1 cr{i) is the position of item i in a ranking a. Item i is pre¬ 
ferred over j if a(i) < a(j). 





























Learning Mixed Membership Mallows Models from Pairwise Comparisons 


captures the pairwise comparison behavior induced by the 
fc-th Mallows distribution and is a function determined only 
by <7fc, 4>k . For convenience, we also define a W x K ma¬ 
trix B as k = n Therefore, the conditional 
probability of the comparisons can be simplified as, 

K 

P {^m,n = = ^ ^ ( 3 ) 

fc= 1 

Before we connect to topic modeling, we summarize the 
properties of the ranking matrix (3 that enable us to infer 
the Mallows parameters directly from B: 

Proposition 1. Let the ranking matrix f3 be defined as in 
Eq. (10), and Ukifik’s tire parameters of the K Mallows 
distribution. Then, \/(i,j) and\/k, we have, 

n a _ 

P{Lo),h - 

b. Ifa k (i) < cr fc (j) andfik < 1, P(i,j),k > 0.5 > fi(j,i),k 

c. Ifakti) = a k(i) + 1 and fl k < 1, 1 /P(i,j),k = 1 + </>fc 

First, by Prop. 1 a., we can directly infer f3 from B. Sec¬ 
ond, by Prop. 1 b., one can infer the relative position of any 
two items in the reference rankings <j \,..., a k by compar¬ 
ing the entries in /3 with 1/2. Therefore, if the estimation 
error in (3 is element-wise small and fk < 1, then, all the 
pairwise relations in the K reference rankings can be cor¬ 
rectly inferred hence the total rankings. Furthermore, the 
dispersion can be estimated using Prop. 1 c. 2 * . In sum, we 
can learn all the model parameters from B. For the rest of 
this paper, we focus on learning B. 

We note that Eq. (3) shares the same structure as in prob¬ 
abilistic topic modeling (Blei, 2012; Airoldi et al., 2014). 
We consider a topic model on a set of M documents, each 
composed of TV > 2 words that are drawn from a vocabu¬ 
lary of size W, with a W x K dimension topic matrix 0™, 
and the document-specific topic weights 0™ sampled in¬ 
dependently from a topic prior Pr T (0). The conditional 
distribution on w™ n , the n-th word in document m, is 

P(w™ n = A 6 ™) = J2 P™ 9 ™m ( 4 ) 

fc=1 

where i = 1,... ,W are distinct words in the vocabulary. 
Noting that B is also column-stochastic, we have. 

Lemma 1. The proposed Mixed Membership Mallows 
Model is statistically equivalent to a topic model whose 
topic matrix (3 is set to be B and the topic prior to be Pr(0). 

Proof. We consider the distribution on the observations in 
both model, i.e, the distribution on the outcomes of pair¬ 
wise comparisons w = {u> m ,n} in M4 and the words 
w™ = {w™ n } in topic model. Note that each user is 

2 If f k = 1, the fc-th Mallows component is the uniform distri¬ 

bution and is un-identifiable. We consider fk < 1 in this paper. 


independent conditioned on 8 rn , from the conditional prob¬ 
abilities in Eq. (3) and (4), we have, 

M r 

p(w|B) — 11 I pi^Wm^ 1, . . . , B) Pi 

m—1 ^ 

M r / N K \ 

= n/ nz Bw m , n ,k0k,m I Pr {0m)d6m 

m—1 \n=l k— 1 / 

= P( w™|/3). 

which is the same as in topic models (Blei, 2012). □ 

Thus, the estimation problem in M4 can be solved by first 
learning B using any topic modeling algorithms, and then 
estimating the parameters of the shared Mallows compo¬ 
nents using Prop. 1. Before we discuss our approach in 
detail in next section, we consider the relation between M4 
and other ranking models. We highlight that the proposed 
M4 is a much more general family that subsumes a few 
existing ranking models as special cases: 

Proposition 2. In Mixed Membership Mallows Model, 

1. If the dispersion parameters fk 0> then, each 
Mallows component has non-zero probability only on 
the reference ranking a k , and the Mixed Membership 
Mallows Model reduces to topic modeling framework 
proposed in (Ding et al., 2015). 

2. If the topic prior Pr(0) has non-zero probability 
only on the vertices of K-dimension simplex, then, 
each user can only be influenced by one Mallows 
components and the Mixed Membership Mallows 
Model reduces to the mixture of Mallows model 
(Lu & Boutilier, 2014; Awasthi et al., 2014) 

3. A Geometric Approach 

We discuss in this section the key geometric insights of our 
approach. We leverage the recent works in separable topic 
discovery that come with consistency and efficiency guar¬ 
antees (Arora et al., 2013; Ding et al., 2014b; Kumar et al., 
2013; Bansal et al., 2014, etc.). The consistency is favor¬ 
able here since we are not enforcing the estimation to be 
valid total rankings. To be precise, we exploit the geomet¬ 
ric property of the second-order moments of the columns 
of X, i.e., a co-occurrence matrix of pairwise comparisons, 
which can be estimated consistently: 

Lemma 2. If X and X' are obtained from X by first split¬ 
ting each user’s comparisons into two independent halves 
and then re-scaling the rows to make them row-stochastic, 
then 

MX'X t -^25-> BRB t =: E, (5) 

almost surely 

where B = diag _1 (Ba)B diag(a), R = 
diag^ 1 (a)Rdiag _1 (a), and a and R are, respec¬ 
tively, the K X 1 expectation and K x K correlation 
matrix of the weight vector 0 m . 
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In this paper, we always assume that R (the K x K topic 
co-occurrence matrix) has full rank which is satisfied by 
many important prior distributions (Arora et al., 2013). 


3.1. Approximate Separability 

The consistent separable topic discovery approaches (e.g., 
Arora et al., 2013; Ding et al., 2014b) require the ranking 
matrix (3 to be exactly separability, i.e., for each k, there 
exist some novel rows (i.e., ordered pairs (i, j)) such that 
P(i,j),k > 0 and /3(ij).z = 0,V l ^ k. If this exact sepa¬ 
rability condition holds, the row vectors in E of the novel 
pairs will be extreme points of the convex hull formed by 
all row vector of E (the shaded dash circles in Fig. 2). 


By the definition of the ranking matrix in Eq. (10), for 
4>k > 0, none of the entries in the ranking matrix (3 is iden¬ 
tically zero. Hence exact separability can not be satisfied. 
However, recall that is the probability of preferring 

item i over j in the fc-th Mallows component, by the prop¬ 
erty of the Mallows distribution, f3(i,j),k will be very close 
to 0 if the position of item j in the reference ranking Ok is 
higher than * by a large margin. Explicitly, 


Proposition 3. Let a k (i) and u k (j) be the positions of 
items i and j in the reference ranking a k of the k-th Mal¬ 
lows component and (j>k < 1. If cr k (i) > <Jk{j) and 
L = a k {i) - + 1, then. 


P(i,j),k < 


1 + Lf L k - 1 


( 6 ) 


Since f k < 1, if L increases, the corresponding P(i,j), k 
is arbitrarily close to 0. Motivated by this observation in 
Prop. 3, we propose to consider the ranking matrix (3 that 
is approximately separable : 

Definition 1. (A- Approximate Separability) A W x I\ 

non-negative matrix (3 is X-approximately separable for 
some constant X £ [0,1), if'fk = 1 there ex¬ 

ists at least one row (i.e., ordered pair) ( i,j ) such that 

^ 0 and (3(i,j),i f ^$(i.j).k> ^ 7^ h. 


The A-approximate separability requires the existence of 
ordered pairs that having negligible probability in all-but- 
one Mallows components, i.e., the row weights concen¬ 
trates predominantly in one column (see Fig. 2). We will 
refer to such pairs (rows of (3) as A-approximate novel pairs 
(rows) for each latent factor. By Prop. 3 for M4, the ap¬ 
proximate separability boils down to the existence of pairs 
of items {i,j} such that i is uniquely preferred over j in 
one reference ranking, while j is ranked higher than i by a 
large margin in all other reference rankings. 

For small A, this seems to be a very restrictive condition 
on the shared latent Mallows distribution. However, as we 
show shortly in the next section, most M4 models are ap¬ 
proximately separable for small constant A > 0 if the num¬ 
ber of items Q scales sufficiently faster than K. Therefore, 


only a negligible fraction of models in M4 do not satisfy 
approximate separability. 

3.2. Inevitability of the Approximate Separability 

We investigate the probability that approximate separabil¬ 
ity is satisfied when we draw uniformly from M4. Specif¬ 
ically, we sample the K reference rankings <r k uniformly 
i.i.d from the set of all permutations, and set (f> k < (f> < 
1, \/k. We have, 

Lemma 3. Let the K reference rankings o± ,..., <7k be 
sampled i.i.d uniformly from the set of all permutations, 
and the dispersion parameters (f> k < <j> < 1, k = 1 ,... ,K. 
Then, the probability that the ranking matrix /3 being X- 
approximately separable is at least 

Wfexp( -I <7) 

where L{(j>, A) = ceil ^(1 + |°g[^ )(1 + e)^ for some pos¬ 
itive constant e, and ceil(x) is the minimum integer that is 
no smaller than x. 


Therefore, for Q K, the ranking matrix (3 is going to 
be approximately separable with high probability. L is de¬ 
termined by log(A)/log(^>), and would be small for very 
small A because of the logarithmic dependence. The proof 
exploits the property illustrated in Prop. 3 and is deferred 
to the supplementary section. We note that the result in 
Eq. ( 12) is only a loose upper bound on non-separable prob¬ 
ability. 


We point out that by definition, approximate separability 
of [3 is equivalent to B. Therefore B is also approximately 
separable with high probability. 

3.3. Robust Novel Pair Detection 


Mallows Mallows Mallows 
component component component 
1 2 3 


Pair 1 

0.98 

0.01 

o.of 

Pair 2 

0.01 

0.99 

0.01 

Pair 3 

0.01 

0.01 

0.90 

Pair 4 

0.98 

0.90 

0.10 

Pair 5 

0.10 

0.09 

0.90 


P 



Figure 2. An example of approximate separable (3 with K = 3, 
and the underlying geometry of the row vectors of E. Pair 1, 2, 
3 are approximate novel pairs for Mallows component 1, 2, and 
3. The shaded dash circles represent the ideal extreme points with 
exact separable (3 and the shaded regions depict their solid angles. 
The numbers in (3 are from fk = 0.1. P(i,j),k ~ 0.01 when 
L = 3, P(i,j),k « 0.1 when L = 2. L = ok(i) - cr k (j ) + 1. 

Recall that when (3 is exactly separable, the novel rows in 
E are extreme points (shaded dash circles in Fig. 2). If (3 
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is A-approximate separable with small enough A, the rows 
E can be viewed as a small perturbation from the ideal 
case. As a result, the rows corresponding to the approxi¬ 
mate novel pairs will be inside the ideal convex hull and are 
close to the ideal extreme points (E pa i r i, Ep^, and E pa j r ,3 
in Fig. 2). On the other hand, the non-novel rows could 
become extreme points but would be close to the convex 
hull formed by the approximate novel rows (e.g., E palr i in 
Fig. 2). 

We detect the approximate novel pairs as the most “ex¬ 
treme” rows of E based on a robust geometric measure, 
the normalized Solid Angle subtended by extreme points 
(see Fig. 2) (Ding et al., 2014b). Statistically, it is the 
probability that a row vector E^j) has the maximum pro¬ 
jection value along an isotropically distributed direction 
d G R^ xl : 


Q(i,j) =p{V(s,f) : ||E (iJ) -E (Sjt) || > C, 

E(i,j)d > E( s>t )d} (8) 

When (3 is exact separable, Q(i,j) = 0 for non-novel pairs 
and are strictly positive for novel pairs. When the deviation 
introduced by A-approximate separability is small, the solid 
angle for approximate novel pairs will be close to that of the 
ideal extreme points. For the non-novel pairs that become 
extreme points due to A-approximate separability (E pa ; r 4 in 
Fig. 2), the associated solid angles will be close to 0 since 
that it is very close to the convex hull formed by the rows 
of approximate separable pairs. In summary, if we sort the 
solid angles for all rows in E, the ones with largest solid 
angles must corresponds to cA-approximate novel pairs for 
some constant c and properly defined C in Eq. (8). 

By definition in Eq. (8), the solid angles can be consis¬ 
tently approximated using a few i.i.d isotropic d’s and 
an asymptotically consistent estimate of E (Ding et al., 
2014b). Once all the approximate novel pairs for K distinct 
Mallows components are identified, B and therefore the 
model parameters can be estimated using constrained lin¬ 
ear regression (Arora et al., 2013; Ding et al., 2014b) and 
Prop. 1. Given the estimated parameters of the ranking ma¬ 
trix /3, we can infer the user-specific preference weight 0 rn 
and evaluate the prediction probability of new comparisons 
using standard inference in topic modeling (Blei, 2012). 

4. Algorithm and Complexity Bounds 

The main steps of our approach are outlined in Alg. 1 and 
expanded in detail in Alg. 2, 3 and 4. Alg. 2 detects all the 
approximate novel pairs for the K distinct latent compo¬ 
nents. Alg. 3 estimates matrix B using constrained linear 
regression followed by row scaling. Alg. 4 further infers 
the model parameters from B using Prop. 1. In particu¬ 
lar, Step 2 of Alg. 4 estimates all the pairwise relations in 
the reference rankings where k = I(a k {i) < &k{j)) 


(which is an equivalent representation of a total ranking), 
and Step 4 estimates (j> k . 


Algorithm 1 M4 Estimation (Main Steps) 

Input: Pairwise comparisons X, X'(FF x M) (defined in 
Lemma 2); Number of latent components K; Number 
of projections P; Tolerance parameters C,e > 0 
Output: Reference ranking a k and dispersion 0/., k 

1: Novel Pairs I •<—NovelPairDetect(X, X', K , P, Q 
2: B G-EstimateRankingMatrix(X, X, e) 

3: <7i,... 0 i, , 4>k •<—PostProcess(B) 


Algorithm 2 NovelPairDetect(via Random Projections) 

Input: X, X'; number of rankings K\ number of projec¬ 
tions P; tolerance £; 

Output: X: The set of all novel pairs of K distinct rank¬ 
ings. 

E g- MX'X T 

v(*, j), <- (CM) : || E(i,j) — 2P(s,t)|| > C/2}, 

for r = 1,..., P do 

Sample d r G 1R U from an isotropic prior 

Q(i,j),r ^ [i.j) , E(s,t)dr X E(,,j)d r } , 

V(*,j) 

end for 

p Er=l V{hj) 
k g- 0,( G- 1, and I g- 0 

while k < K do 

(s, t) G- index of the Z th largest value among s 

•f(M) € f| then 

IgIU {(s, t)}, k G- k + 1 

end if 

l G- l + 1 

end while 


Algorithm 3 Estimate Ranking matrix 


Input: X = {(*i, ji ),..., (ik, jx)} the set of novel pairs 
of K rankings; X, X'; precision e 
Output: B as the estimate of B. 

Y = (*(Uw 
Y ' = (X(U)> 

for all (i,j) pairs do 

Solve argminM(X (i j) - bY)(X' f . ., - 

b ^ 


X r 1 T 
' {iK,3 k)> ’ 

X ,r 1 T 


bY') T 

Subject to b k >0, Y^k=i bk = 1, With precision e 

end for 


B column normalize (3 
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Algorithm 4 Post Processing 


Input: B as the estimate of B 

Output: a k ,<j>k,k = l,...,K 




Mi, j G uyk 


v ' B ( 

2: «- Round[^ (i ,j),fc], Vi, j G U,Mk 

3: a k 3— GlobalRank((T( i j) k , Vi, j) Mk (First count the 
number of times each item wins in all pairwise comparison 
and then sort.) 

4: <j>k E^Ti 1 X - - L -!> Vfc Ok H*) 

is the item in the ?.-th position in ranking cr*,.) 


Our approach has an overall polynomial computation com¬ 
plexity in all model parameters. 

Theorem 1. The running time of Algorithm 1 is 
<D(MNK + Q 2 K 3 ). 

The proofs are in supplementary. We note that the term 
Q 2 K 3 is a loose upper bound for linear regression in 
Alg. 3. We also derive the sample complexity bounds for 
Alg. 1 which is also polynomial in all model parameters 
and log(l/<5) where 5 is the error probability. Formally, 

Theorem 2. Let the ranking matrix [3 be X-approximate 
separable and the second order moments R of ranking 
prior to be full rank. If 

A < Qmintt(l - f)q A 

~ 8K 2 ao^/log(W/q A ) 


and M , P —> oo, then, Algorithm 1 can consistently re¬ 
cover all the reference rankings of the latent Mallows dis¬ 
tributions. Moreover, MS > 0, if 


M > max^ 


640FF 2 log(3FF/<5) 
Nri 4 d 2 ql ’ 


320FF log(3 W/c>) \ 

iVr/ 4 A 2 nitl a 2 nin (l - f) 2 J 


and for 

p>32 M3 wm 

9a 

the proposed algorithm fails with probability at most S. 
The other model parameters are defined as follows: rj = 
mini< I „<vi/[Ba] u ,; a max , a m i n are the max/min of entries 
of a; a 0 = maxij a,i/aj;Y = RB; k = A min /A max is 
the condition number of R; g A be the minimum normal¬ 
ized solid angle formed by row vectors of Y; d = 6 k/K; 
4>k < </> < 1. N is the number of comparisons of each user. 

The detailed proofs are summarized in the supplementary 
file. Eq. (19) provides an explicit sufficient upper bound on 
the required A-approximate separable degree. It is roughly 
inverse polynomial in I\. By Prop. 3, the margin L required 
to satisfy A in Eq. (19) should scale as 0(log(iC)) which is 
small. 


We note that in the complexity bounds, the term 1 — <fi 
represents the spread of the Mallows components and de¬ 
termines the hardness of estimation: for smaller A can 
be larger and the required M is smaller. When <f> —> 1, 
Eq. (19) reduces to A = 0 and M > oo which is not 
achievable and the corresponding Mallows distribution is 
un-identifiable. 

5 . Experimental validation 

We conduct experiments to validate the performance of our 
proposed approach when the M4 assumptions are satisfied 
on semi-synthetic dataset, and then demonstrate that the 
proposed M4 can indeed effectively capture the preference 
behavior in real-world datasets. In all experiments, we used 
the suggested settings by (Ding et ah, 2014b). Specifically, 
the number of random projections P = 150 x K, the toler¬ 
ance £ = 0.05 in Alg. 2 and e = 10 -4 in Alg. 3. 

5.1. Semi-synthetic Simulation 

We generate synthetic examples according to proposed M4 
and evaluate the proposed algorithm using reconstruction 
error measured by the Kendall’s tau distance between the 
estimated reference rankings and the ground-truth. Since 
our estimation is up to a column permutation, we align 
the estimated reference rankings using bipartite matching 
based on the Kendall’s tau distance. 



Figure 3. The normalized Kendall's tau distance of the estimated 
reference rankings, as functions of M, from the semi-synthetic 
dataset with Q = 100, N = 300, K = 10 and different tf>. 

The ground-truth reference rankings are obtained from a 
real world movie rating dataset, Movielens, using the same 
approach as in (Ding et ah, 2015) over Q = 100 items and 
K = 10. We set the same dispersion parameter for all Mal¬ 
lows components as cf>k = <f> for (j> = 0, 0.1,0.2,0.5. We 
use symmetric Dirichlet prior with concentration ao =0.1 
to generate ranking weights 9 m ’ s. N = 300. pij = 
1 /(?),V*,j- The error is further normalized by W = 
Q(Q — 1) and averaged across the K reference rankings. 

Fig. 3 depicts how the estimation error varies with the num¬ 
ber of users M with different values of dispersion. We can 
see that the reconstruction error in reference rankings for 
0 = 0,0.1, 0.2 converges to zero at different rates as a func¬ 
tion of M. For M4 with 0 = 0.5, it converges to a small 
but non-zero number when M —> oo. We note that for the 
ground-truth ranking matrix (3, it is A = 0,0.01,0.05,0.20 
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approximate separable for </> = 0,0.1,0.2,0.5 respectively. 
Our approach therefore can correctly detect the reference 
rankings when A is small. When A is mild, it can still de¬ 
tect most of the reference rankings correctly. 3 

5.2. Comparison prediction - Movielens 

We consider in this section prediction of pairwise compar¬ 
isons in a benchmark real-world dataset, Movielens. 4 The 
star rating dataset is selected due to public availability and 
widespread use, but we convert it to pairwise comparisons 
and focus on modeling from the partial ranking viewpoint, 
as suggested in the ranking literature (Lu & Boutilier, 
2014; Volkovs & Zemel, 2014; Ding et ah, 2015). 

We focus on the Q = 200 most frequently rated movies in 
the Movielens, split the first M = 4000 users for training, 
and use the remaining users for testing (Lu & Boutilier, 
2014). We convert the training and test ratings into com¬ 
parisons independently: for all pairs of movies i,j user m 
rating, w mt „ = ( i,j ) is added if the star ratings for i is 
higher than j, and all ties are ignored. The prior is set to be 
Dirichlet and it is estimated using methods in (Arora et ah, 
2013) given estimated (3. 

We evaluate the performance by the held-out log- 
likelihood, i.e., Pr(w test |/3). The log-likelihoods are cal¬ 
culated using the standard Gibbs Sampling approximation 
in (Wallach et ah, 2009). The log-likelihoods are then nor¬ 
malized by the total number of comparisons in the testing 
phase. We compared our new model (M4) against the topic 
modeling based model in (Ding et ah, 2015) (TM) with 
closest settings to our model. We summarize the predic¬ 
tive probability for different K in Fig. 4. One can see that 
M4 improves the prediction accuracy of TM for different 
choice of K and can better fit the real-world observations. 



5 10 15 20 

# of latent factors K 


Figure 4. The normalized predictive log-likelihood for various K 
on the truncated Movielens dataset. 

5.3. Rating prediction via ranking model - Movielens 

To further demonstrate that our model can capture real- 
world user behavior, we consider the standard rating pre¬ 
diction task in recommendation system (Toscher et ah, 
2009). We first train M4 using the training comparisons, 

3 For a random f3 with Q = 100, K = 10, it is 0.05- 
approximate separable with probability .933, .870, .793, .426 for 
(j> = 0, 0.1, 0.2, 0.5 in a 1000 Monte Carlo runs. 

4 http://grouplens.org/datasets/movielens/ 


Table 2. Testing RMSE on the Movielens dataset 


K 

PMF 

BPMF 

BPMF-int 

TM 

M4 

10 

1.0491 

0.8254 

0.8723 

0.8840 

0.8509 

15 

0.9127 

0.8236 

0.8734 

0.8780 

0.8296 

20 

0.9250 

0.8213 

0.8678 

0.8721 

0.8241 


and then predict ratings by aggregating the prediction of 
properly defined test comparisons. The purpose of this ex¬ 
periment is not to optimize to achieve the best empirical 
result in the rich literature on rating prediction. 

We use the same training/testing rating split from 
(Salakhutdinov & Mnih, 2008a), and focus on the Q = 100 
most rated movies in Movielens following (Ding et ah, 
2015). We convert the training ratings into training com¬ 
parisons (for each user, all pairs of movies she rated in 
the training set are converted into comparisons based on 
the stars and the ties are ignored) and train a M4 model. 
The ranking prior is set to be Dirichlet. To predict stars 
rating ry m of user m for movie i, we consider the fol¬ 
lowing method: for s = 1,..., 5, we set fi >m = s, and 
compare it against the movies user m has rated in the 
training set. This generates a set of pairwise comparisons 
Wj im (s). For example, if user m has rated movies A, B, C 
with 4, 2,5 stars respectively in the training set and we 
are predicting her rating for movie D. Then for s = 3, 
w D,m (3) = {(A, D ), (D, B ), (C, D )}. We choose s such 

that, _ ^ 

^'i,m — arg max (s) | w train , /3) • 

s 

We evaluate the performance using the standard root-mean- 
square-error (RMSE) metric (Toscher et ah, 2009). We 
compared our approach, M4, against the topic model¬ 
ing based methods in (Ding et ah, 2015) (TM), and two 
benchmark rating-based algorithms, Probability Matrix 
Factorization (PMF) in (Salakhutdinov & Mnih, 2008b), 
and Bayesian probability matrix factorization (BPMF) in 
(Salakhutdinov & Mnih, 2008a) that have robust empirical 
performance 5 . Both PMF and BPMF are latent factor mod¬ 
els and the number of latent factors K has the similar inter¬ 
pretation as in M4. Note that the ratings predicted by our 
algorithm are integers from 1 to 5, we also round the output 
of BPMF to the nearest integers from 1 to 5 (BPMF-int). 

We report the RMSE for different choices of K in Ta¬ 
ble 2. It is clear that M4 improves upon the ranking- 
based TM in which the latent factors are restricted to single 
permutations. On the other hand, when compared to the 
rating based algorithms, the RMSE of our M4 approach 
can match BPMF and outperforms BPMF-int and PMF al¬ 
though they are coming from a different feature space. We 
note that the BPMF typically provides robust and bench¬ 
mark results on real-world problems. This demonstrates 

3 We use the suggested settings to optimize the hyper¬ 
parameters and use the implementation and data split from 

http : / /www . cs . tor onto . edu/ ~rsalak.hu/BPMF . html 
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that our approach can accommodate noisy real-world user 

behavior. 
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A. Proof for Proposition 1 in the main paper 

We first consider the property of the ranking matrix f3 for 
M4 as summarized in Proposition 1 in the main paper. Re¬ 
call that the ranking matrix (3 in M4 is defined as 

P(i,j),k ■= PmH OkAk) ( 10 ) 

a: cr(i)<a(j) 

Proposition 1 (in the main paper) Let the ranking matrix 
(3 be defined as in Eq. (10), and a k , fk’s are parameters of 
the K Mallows distribution. Then, V(i, j) and Vfc, we have, 

„ a, _ 

a. P(i,j),k - B ( ,, ]hk +B u , t y k 

b. If cr k (i) < o- fe (j) and0 fc < 1 , > 0.5 > /% i)ifc 

c. If cr fe (j) = a k (i) + l and cf> k < 1,1 /P(i,j),k = l + ^fc 

Proof. Fora), 

_ _ Pi,jP(i,j),k _ _ ^ ^ 

33^j^ k P(i,j),k Pj,ifi(j,i), k 

since /i, :j = p jti and /3^j) >k + P(j,i), k = 1- The proof of 
b ), c) can be derived from the proof for Proposition 3 in the 
main paper, (see next section) □ 


B. Proof for Proposition 3 and Lemma 3 in 
the main paper 

We first proof the Proposition 3 in the main paper. 

Proposition 3 (in the main paper) Let a k (i) and cr k (j) be 

the positions of items i and j in the reference ranking a k of 
the fc-th Mallows component. f k < 1. If cr k (i ) < a k {j) 
and L = a k {i) — P k (j) + 1, then, 

( 11 ) 


Proof. Due to the symmetry in the ranking space, we con¬ 
sider a k (i) = i hence <j k : 1 >- 2 >- • • • >- Q where >- 
indicates “prefer over”. Instead of directly calculate sum¬ 
mation as in the definition, 

cr: cr(i)<a(j) 

we consider the Repeated Insertion Model (RIM) proce¬ 
dure. RIM is a generative procedure for sampling a ranking 
which is equivalent to sampling a ranking from a Mallows 
component. Specifically, in RIM, a ranking cr is obtained 
by sequentially placing the i-th item in the reference per¬ 
mutation (a k ) into the j, -th position (of the current partial 
sequence of length i), 1 < j, < i, in a probabilistic fashion: 


Pi(ji = 0 


li—l 


i + f 1 +... + f 1 1 


and l < i, 1 < i < Q. 

Let i < j. By definition f3(i,j), k is the probability that item 
j is inserted after item i in the RIM procedure. According 
to the procedure of RIM, this probability is irrelevant to 
the items after j and by symmetric, it is irrelevant to the 
items before i. Without loss of generality, we set % = 1 and 
consider 1 < j < Q. For simplicity, we denote tp k = (f> < 
1 . 

We first consider q r >s , the probability of item 1 being on the 
r-th position in the sequence after inserting the s-th item. 
1 < s < j and 1 < r < s. By induction, we shall show 
that q rtS = 1 + 0 i^... + 0 «_i • 

As a initial point, after inserting the second item when 
s = 2, q liS = and q 2 , s = Assume for all 

s = 1 ,..., s, the assumption hold true, then for s + 1 , 
and 1 < r < s + 1 , (i.e., after inserting the item s + 1 ) 


<7r-,s+i = qr,s Pr(j s +1 > r) + q r -i, s Pr(j s+ i < r) 


where j s +i is the position of item s + 1 after inserting it 
into the partial sequence. By the induction assumption. 


4r r ~ l 

qr ' s ~ 1 + (j ) 1 + • • • + 4 P - 1 
f r ~ 2 

9r - 1 - 8- i + ^ + + 


Therefore, 


Qr,s +1 — 


i + r + • 


ir -2 


— Pr(j s +i > r) 
—y Pr(j s+ i < r) 


1 + f 1 + ■■■ + f 8 - 1 

(jf- 1 1 + f + ■ + f 8 - 7 - 1 

i + f 1 + • • • + f 8 - 1 i + • + 4> 8 

6 r-2 6 s-r +1 

+ 


+ ■ + < 


1 + (j) 1 4-b f 8 - 1 1 

Ar-l 


ks+1 — 1 


Similarly, it is true for r = 1 and r = s. This conclude the 

j^r—1 

induction hypothesis that, g rjS = ■ 
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Now we can calculate j3n 
3 — 1 

P{x,j),k = X! 9V-J-1 Pr C?i > r ) 

r—1 

J—1 n—2 

E 

r=l l=r —1 

(i + --- + ^- 2 )(i + -" + <^- * 1 ) 

_l - jft -1 + (j - l )ft _ 

(l - ft 2 (l + • • • + ^- 2 )(i + • ■ • + ft- 1 ) 

Similarly, we have, 

o = _ j -l-j(j) + ft _ 

POM),fc _ ^)2 (1 + . . . + ft - 2 )(1 + . . . + ft~l) 

Therefore, 

Aij),fc = i - jft ^ 1 + O' - i)ft > i 
% i),fe ft -1 (i -1 - j<t> + ft) ~ jft -1 

and this conclude our proof. 

We note that in the above equation, if we set j = 2, we got 

= i. This proves Proposition 1 c. We also note 

that ft'i/l > 1 so /5(i,j),fe > 0.5 > /3(j,i),fc. This proves 
Proposition lb. □ 

Now, we consider the Lemma 3 in the main paper that 
shows the inevitability of the approximate separability of 
a random M4. 

Lemma 2 (in the main paper) Let the K reference rank¬ 
ings <ti, ..., (Tk be sampled i.i.d uniformly from the set of 
all permutations, and the dispersion parameters ft < 4> < 

1 ,k = 1 ,,K. Then, the probability that the ranking 
matrix (3 being A-approximately separable is at least 

(12) 

where Lft, A) = ceil ^(1 + + e)^j for some pos¬ 

itive constant e, and ceil (a;) is the minimum integer that is 
no smaller than x. 

Proof. Note that by Proposition 3 in the main paper, if i is 
preferred over j in tri and under j in other central permu¬ 
tations and the distance of their positions are L, then, the 
corresponding row is at most L(f > L ~ 1 approximately novel 
row for the first topic. This is same for all the topics. 

We note that if we consider two groups of disjoint items, 
then, the relative rankings within each group is independent 


to the other group if the ranking is sampled uniformly from 
all the permutations. In general, we divide the Q items into 
Q/L groups of disjoint items, each containing L items, de¬ 
noted by ..., for t = 1,..., Q/L. If a cen¬ 
ter permutation ak is sampled uniformly random from the 
set of all permutations, then, all the partial rankings within 
each group t are independent to that of another group s. 

We now consider for each of these /.-tuples, the probability 
that there exist two items i,j such that i is first and j is 
last in the group for first central permutation <ti, and in the 
opposite way for the other permutations. We denote this 
probability by pi (ft A, k). By definition, we have, 

Pi(ftX,k) >Pr{3i,j £ {it,i, ■ ■ ■ < ... < ui 

a 2 (i) > ...> a 2 (j), ■ ■ .,a K (i) > ... > a K 

Now, let Bk 1 k = 1 ,,K denote the event that none of 
the Q/L groups has a A-approximately separable row, then, 
following the same argument as in Lemma ??, we have, 

Pr((J Bk) < K exp(-Qpi/L) < IT ex P (^ £2 ^_ 1 ) 

as a upper bound for the probability of (3 note being sepa¬ 
rable. We require L = L(<p : A) such that Lft -1 < A. This 
concludes the proof. □ 

C. Analysis of Proposed Algorithm 1 in the 
main paper 

Now we formally prove that if a ranking matrix er is A- 
approximately separable where A being small enough, the 
proposed Algorithm 1 can consistently estimate the refer¬ 
ence rankings of the shared Mallows components. 

Indexing convention: For convenience, for the rest of this 
appendix we will index the W = Q(Q — 1) rows of B and 
E by just a single index i instead of an ordered pair (i,j) 
as in the main paper. 

C.l. Consistency of Algorithm 2 

Recall that E = BY. We decouple the effect of A- 
separability from the error in estimating E. Note that the 
second error converges to 0 as M, N —> oc, we shall fo¬ 
cus on the perturbation on solid angle as a result of the 
A-approximate separability. 

For i being a A-approximate novel row, let E' ; ’ = Y^ as the 
corresponding row of Y. Otherwise, let E? = E, be the 
rows of E. For each approximate novel row i, define the 
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original solid angle as, 

q° = Pr (Vj : ||E° - E°|| > d : E°u - E°u > 0) 

(13) 


and define the A-approximate solid angle as, 

Qi = Pr (Vj : ||Ej - Ei|| > d : E iU - E jU > 0) (14) 

for i being a A approximately novel row. Therefore, for any 
constant c > 0, 

k° - Qi\ < Pr (3j, *, |E°u - E°u - EiU + Ejii| > c) 
+ Pr(Vj,*,|E°u-E°u| <c) (15) 

where we have replace the distance constraints with * for 
convenience. We note that E^ = Without 

loss of generality, assume that i is a A-approximate novel 
row for Yi, then, E‘ J = Yi. Taking a closer look at the 
second term in the above equation, we have, 

K 

|E°u-E°u| =\Y,B jk (y k -Y iH 

k=2 
K 

<Y,B jk \( Yfc-YOul 

fc=2 


And note that Y k ,k = 2are among the E°’s, 
therefore, the second term in (15) is equivalent to Pr(j = 
k,...,K,\(Y k - Yi)u| < c) hence by union bounding, 
we have. 


K 

Pr (Vj, *, |E°u - E°u| < c) < Y Pr(|(Y fc - Y^ul < c) 

fe =2 


Note that (Y fc — Yi)u ~ jV(0, ||Y fe — Y 1 1||), by the prop¬ 
erty of Gaussian distribution, we have. 


Pr(|(Y/s — Yi) U | < c) 

r i 


vmY fc - Yr 


e - t V2iiY k -^r dt < 


I Yi. - Yi 


and is one of the j’s in (14), hence j correspond to another 
topic. Therefore, by the same argument, 

K M 

||E° - Ei|| = ||Yr - Y 5 * Y *H ^ E B *W Y ' - Y fcll < 

k =1 k=2 

M 

— A E II Yi — Yfc || 

k—2 

Combining the steps together, for Eq. (15), we require, 

\q° - ?i| < —-— + IPexp(-[—^-] 2 ) < q A / 3 

Pmin AI\ p ma x 

where q A is the minimum solid angle of Y. This is require 
so that the estimated solid angle for the A-approximate 
novel rows is well-separated from the solid angle of the 
remaining non-novel rows. Recall that p ]n \ r] and p max is 
defined as the minimum and maximum values of ||Yj — 
Yj 11.1 < i j < K. To parse the above equation, we set 
c = 9A 3 P j^ in and therefore, we require 

^ ^ <?APmin ^ q/\K 

~~ 3fT 2 p maxV / log(VP/g A ) ~~ 3AT 2 v /log(lY/g A ) 

We can now apply the same argument to the other rows i 
whose 4-neighbor does not enclose a novel word. We thus 
require d > 12XK ^/\og(W/ q A ) / q A . To combine the two 
results, we can set 


d = 0 (k/K) (16) 


To summarize the discussion, we have. 
Proposition 4. If A is small enough such that, 


A < 


3K 2 ^/log(W/q A ) 


(17) 


with d set as in (16). Then, for M, N —> oo and the number 
of projections P —> oo, the proposed algorithm can find 


O 


[2K y/log(W / q A ) jqPj A -approximately novel rows for 


K distinct topics. 


C.2. Consistency of Algorithm 3 


For now we denote by p m j n the minimum of ||Yfc — Y;||, 
therefore, the second term in (15) can be upper-bound by 

Pmin 

For the first term in (15), let e,; i7 = E° — Ej — Ej + Ej 
and note that e^jU ~ Af(0, He^jH 2 ), then, 

Pr(|e A ju| > c) = 2Q(c/||ei,j||) < exp(-c 2 /2||e iii f 2 ) 

Further, \\eij < ||E° - E,|| + ||E° - Ej||. For j which is 
not a A-approximate novel row and is one of the j’s in (14), 
||E° — Ej|| =0. For j being a A-approximate novel row 


We now consider the error accumulated in steps in Algo¬ 
rithm 3 in main paper. Assume the Algorithm 2 is cor¬ 
rect, we obtain K row vectors, Ej, j = 1,..., K, as A- 
approximate novel pairs for the K distinct Mallows compo¬ 
nents. Without loss of generality, Ej approximately novel 
to the j-th Mallows component (j-th column). We further 
denote by E° the ideal extreme points , i.e., E ( - = Yj for 
j = 1 ,... ,K. Note that by definition, 

K 

E i = Y, Bik.^k 

k =1 
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For i = 1 ,K, k ^ i, we have Bi k < XBu. B is a 
row-stochastic matrix. For i = 1,..., W, the correspond¬ 
ing row vector B, is the optimal solution of the following 
constrained linear regression. 


K 

b* = arg min ||E, - X^ 6 ,E°|| 
bj>0,Y.bj = l ^ J 

3=1 

Now consider the empirical version we have access to 
which is. 


b* = arg min IIE* 

b 3> o,X b 3 = 1 


K 

E & Aii 


3=1 


To bound the error between b* and b* due to approximate 
separability, we can establish the following property: 


Proposition 5. Suppose that for j = 1,,K, |E ;; — 
Ej || 2 < 5i and ||Ej—E* || 2 < <$2 a fixed i. Assume also that 
E j,j = 1,... ,K are at most X-approximately separable 
and (K — 1)A < 1, then, 


lb* — b *|| 2 <4 


<5i + <5 2 


(1 - (K - l)A)A n 


where A m i n denotes the minimum eigenvalue of R. 


K 

Proof Let /(E°,b) = ||E, - X] E°|| for any b and 

3=1 

note that for the optimal solution b*, /(E°, b*) = 0. Let 
Y = [E? t , ..., E§X] T , we have, 

K 

/(E,b)-/(E,b*) = ||E i -E^E5||-0 

3=1 

=11 X> - ^*)E°|| = ^/(b-b*)YYT(b-b*)T 
3=1 

>||b-b*||A min , F 

Recall that Y = RB t and let B T = [Bk, BA where 
the K x K Bk are approximately separable. Note that 
BB < A and A (K — 1) < 1, then, by the 
Gershgorin circle theorem, the minimum eigenvalue of Bk 
is lower-bounded by EEi)a > ■ Therefore, 

A m in,r > A m i n 1 ~( a 2 ~ 1 ) a where A m in is the minimum 
eigenvalue of R. Next, note that for any probability vec¬ 
tor b. 


Combining the above inequalities, we obtain, 

||b* - b*|| <E—{/(E, b*) - /(E, b*)} 

A m in,F 

E ^ /( g ,b*) 

-/(E,b*) + /(E,b*)-/(E,b*)} 
< 3 -^—{/(E,b*)-/(E,b*) 

^min? * 

+ /(E,b*)-/(E,b*)} 

“A min (l- A(tf - 1)) (<Jl + 

□ 


C.3. Consistency of Algorithm 4 


We first consider the row normalization step in Algorithm 


3. Note that, b*(i,j) k = B (ij)k = • 

define the row-scaling factor. 


We 


Pi 


,3 ~ 5 Z X(i,j),m /+ E 


and by definition Pij — > X] 0 (i,j),l a l < 1 as M — > 00 . 
If we define +- Pi,jb*(i,j) k as intermediate step, 

and then compute C(i,j),k/(, c (i,j),k + CQyyf). Note that 
c {i,j),k = P(i,j).kAk in the ideal case, in order to learn the 
hidden ranking correctly, we only need cuj\ k /{cuj\ tk + 
c (j,i),k) = Pu,j),k to remain in the correct interval of either 
[0, 0.5] or [0.5,1]. Therefore, the error in estimating C(ij), k 
should satisfy. 


^(i,j),k | — n&|0.5 P{i,j),k I 

Recall that can be estimated much accurate than b*, 
Therefore, we can consider the error in c as the result of 
error in b*. Note that the minimum of the |0.5 — Pu,j),k I i s 
achieved if the position of item i.j in the reference ranking 
are next to each other and |0.5 — <J{i,j),k I > 2 ( 1 + 0 ) — 
(1 — 0)/4. Therefore, we require, 

I b*(i,j)k - b*(i, j)k\Pi,j < Ofc( 1 ^ 0)/4 

Let a m j n = minafe and note that J < 1, using result in 
Prop. 5, we require. 


Si + 82 < amin Amin (1 - (K - 1)A)(1 - f )/8 (18) 


I/(E, b) - /(E, b)| <IIEj -E, + E bj{% - E°)|| 
<||Ei — E* || + E bj ||Ej — E°|| 
<82 + < 5 i 


Now, we express <5i and 62 in terms of A. Note that 82 = 
l|Ei-Ei|| and<5, = < ||E i -E J || + ||E°-E,||. 

82 and the first term in t>i converges to 0 exponentially in 
M, N and does not depend on A. Hence we focus on the 
term ||E° — Ej ||. 
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Note that ||EP-E,-1| =J| ^ B jk (E° k )-(1-B^l 
Let v = [-(1 - .Bn), B 12 ,..., B 1K } (wlog, consider j = 
1), then, ||EP — Ej|| < ||u||A maXj y. Following the same 
steps in Prop. 5 and denoting A max to be the maximum 
eigenvalue of R, we have, A maXi y < (1 + (K— 1) A) A max , 
and ||r;|| < A (K — 1)/(1 + ( K — 1)A). Combining the 
results, we have, 

||E°-EJ < X(K - l)A max 

Let’s consider K A€l and using all the results above, we 
need, 

^ ^ Ctmin Amin(l 

8 KX 

max 

Formally, to combine the above two sections for Algorithm 
3 and 4, we have, 

Proposition 6. Assume K rows that X-approximately novel 
pairs for K distinct Mallows components are selected. The 
remaining steps, i.e., constrained linear regression, row- 
scaling, and post-processing can recover the true reference 
rankings of all Mallows component when M —> oo and 

A - 8 (jr-i) 

where a m i n = min^ dk, n = A m i n /A max > 0 is the condi¬ 
tion number of R, and (j> = maxfc fk < 1. 


of a; ao = m&Xij ai/aj', Y = RB; k = A m i n /A max is 

the condition number of R; <y A be the minimum normal¬ 
ized solid angle formed by row vectors of Y; d = 6k/ K ; 
<fik < 4> < 1. N is the number of comparisons of each user. 

Proof First note that Bij. = Therefore, if 

(3 is A-approximately separable, then, B is at most oqA- 
approximately separable. 

Now, assuming that Xdo < - , 9Ah , by proposi- 

3K 2 iflog(W/q A ) 

tion 4, the novel word step via random projection can select 
roughly c\K Aao/gA- a PP r °ximately separable novel words 

if M, N —> oo and P —> oo. 

Now apply proposition 6, we require c±KXdo/q/\ < 

a mi n«a —0) , thereforei 

x < amin^(l ~ 0)g A _ gminft(l ~ </>)gA 

8 c\K 2 dQ 8 K 2 dQyJ\og(W/q/\) 

Note that this is stronger than previous constraints. In sum, 
given these constraints, and let M, P —> oo, the estimation 
on the center rankings are consistent. 

The sample complexity follows directly from results as in 
(Ding et al., 2014a) except for the constants. □ 


C.4. Overall sample complexity of the Algorithm 1 via 
random projection 

We can directly combine the results from Prop. 4, 5 and 6 
to obtain the consistency results for the overall algorithm. 

Theorem 2 in the main paper Let the ranking matrix (3 be 
A-approximate separable and the second order moments R 
of ranking prior to be full rank. If 

, < «mintt(l - f)q A 

~~ 8K 2 a 0 ^/log(W/q A ) 

and M , P —► oo, then. Algorithm 1 can consistently re¬ 
cover all the reference rankings of the latent Mallows dis¬ 
tributions. Moreover, V<5 > 0, if 


f 640FF 2 log(3W'/<5) Z26W \og{ZW/ 5) | 

“ mdX j Nifdfql ’ A^ 4 Ai in < in (l - f ) 2 j 

and for 

<Za 

the proposed algorithm fails with probability at most S. 
The other model parameters are defined as follows: rj = 
mini<,u<ty[Ba],„; a max , a m i n are the max/min of entries 
















